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Abstract 

In our work we present two parallel algorithms and their lock-free im- 
plementations using a popular GPU environment Nvidia CUD A. The first 
algorithm is the push-relabel method for the flow problem in grid graphs. The 
second is the cost scaling algorithm for the assignment problem in complete 
bipartite graphs. 

1 Introduction 

The maximum flow problem has found many applications in various com- 
puter graphics and vision problems. For example, the algorithm for the graph 
cut problem is an optimization tool for the optimal MAP (Maximum A- 
Posteriori Probability) estimation of energy functions defined over an MRF 
(Markov Random Field) [4, 11, 12, 13, 17]. Another new and most interesting 
for us idea consists in computing optical flow by reducing it to the assignment 
(weighted matching) problem in bipartite graphs ([18]). Therefore it is im- 
portant to look for new solutions to improve the execution time of algorithms 
solving the max flow and related problems. 

The new approach to acceleration of algorithms uses the power of GPU 
(graphics parallel units). This has motivated us to pose the main pur- 
pose of this research: find an efficient parallel algorithm for the weighted 
matching problem and implement it in a popular GPU environment Nvidia 
CUDA. Naturally, such an algorithm can use max flow computation tech- 
niques and the easiest method to compute maxflow in parallel is the push- 
relabel algorithm. Hence in the first part of our work we develop our own 
CUDA implementation of Hong's lock-free push-relabel algorithm. It can be 
used to find graph cuts in graphs constructed by Kolmogorov et. al. in [12] 
and is suitable for minimization of any energy function of the class charac- 
terized by these autors. 

In the second part of the paper we focus on the assignment problem, 
ie. the max weight matching problem. The cost scaling algorithm solving 
this problem has also found applications in many computer vision tasks such 
as recognition and tracking of images. Finally we present our implementation 



of the cost scaling algorithm for the assignment problem, where the core 
refine procedure is implemented lock-free on CUDA. 

This work is organized as follows. Section 2 contains all needed defi- 
nitions in our paper. Section 3 introduces the CUDA programming envi- 
ronment. Section 4 presents the sequential push-relabel algorithm and its 
parallel counterparts: a blocking version of the parallel push-relabel algo- 
rithm, presented by Vineet and Narayanan [4], and a lock-free push-relabel 
algorithm by Hong [5]. In the end of this section we present our CUDA 
implementation of the lock-free push-relabel algorithm for grid graphs. Sec- 
tion 5 starts from the presentation of two sequential algorithms: a scaling 
minimum-cost flow method and its counterpart for the assignment problem, 
both developed by Goldberg et al. [2,8,9]. Then we present our own par- 
allel cost scaling algorithm for the assignment problem using the lock-free 
push-relabel algorithm. In the end of section 5 we present our CUDA imple- 
mentation of this algorithm for arbitrary graphs. Figure 1 shows the diagram 
of the reductions between the main analyzed problems. 

2 CUDA Programming 

CUDA (Compute Unified Device Architecture) is a parallel computing 
architecture for Nvidia GPUs. In the past the GPU was used only as the co- 
processor of the CPU to reply on the many graphics tasks in real-time. Now, 
thanks to increasing computation power of GPUs, they are also very efficient 
on many data-parallel tasks. Therefore GPUs are used also for many non- 
graphics applications like the push-relabel max flow algorithm. 

We present two algorithms implemented in CUDA 4.0. Both were tested 
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for the minimum-cost flow problem 
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Cost scaling algorithm 
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for the assignment problem 
(this paper) 



Figure 1: The algorithms described in Section 5 
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on Nvidia GTX 560 Ti (i.e. on a device of compute capability 2.1). 

The programs operating on the GPU are implemented in the CUDA 
C language, an extention of C. CUDA C allows to define C functions called 
kernels that can be executed in parallel. The C program is run on the host 
(or CPU) by the host thread. The kernels (CUDA programs) are launched 
by the host thread and run on the device (or GPU) by many CUDA threads. 
The number of threads executing a kernel is defined by the programmer. 
The running threads are split into three-dimensional blocks. The set of blocks 
forms a three-dimesional grid. 
An example of a kernel declaration is: 

global void kernel_function(int* data); 

which must by called like this: 

dim3 gD = 10; /* dimension of grid, unspecified component is initialized to 1*/ 
dim3 bD = (32, 8, 1); /* dimension of block */ 
kernel_function <gCgD, bD;§> (someArray); 

Each thread executing a kernel is given the following built-in three- 
dimensional variables: 

dim3 threadldx = ( threadldx.x, threadldx.y, threadldx.z ) /* unique for each 
thread in the block */ 

dim3 blockldx = ( blockldx.x, blockldx.y, blockldx.z ) /* unique for each block in 
grid */ 

dim3 blockDim = ( blockDim.x, blockDim.y, blockDim.z ) 
dim3 gridDim = ( gridDim.x, gridDim.y, gridDim.z ) 

Hence for each thread we can calculate its unique index in grid in the fol- 
lowing way: 

int threadsInBlock = blockDim.x * blockDim.y; 

int numberOfBlocklnGrid = (blockldx.y * gridDim.x) + blockldx.x; 
int numberOfThreadlnBlock = (threadldx.y * blockDim.x) + threadldx.x; 
int thid_id = (threadsInBlock * numberOfBlocklnGrid) + numberOfThreadln- 
Block; 

The host and the device have separated memory spaces called the host 
memory and the device memory, both residing in the dynamic random-access 
memory (DRAM). There are several types of device memory: global, local, 
shared, constant and texture. There are also registers. The shared memory 
and the registers are located on the chip of the GPU. The others are located 
off the chip so they have large access latency. However the sizes of the 
shared memory and the registers are much smaller than of the others. Note 
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that since the local memory is located off chip then its access latency is also 
big. 

The scopes and the lifetimes of the local memory and the registers are re- 
stricted to one thread. The scope of the shared memory and its lifetime 
is restricted to all threads of the block. The lifetime and access to other 
of memory are available for all launched threads and the host. 

For devices of compute capability 2.x the local and the global memory 
are cached. There are two types of cache: an LI cache for each multipro- 
cessor and an L2 cache shared by all multiprocessors on the GPU. The size 
of the L2 cache is fixed equal to 768 KB. The LI cache and the shared mem- 
ory are stored in the same on-chip memory and their initial sizes are 48 KB 
of shared memory and 16 KB of LI. These values can be reversed by the func- 
tion cudaFuncSetCacheConfig() (or cuFuncSetCacheConfig() for Driver API). 
In our implementations the shared memory is unuseful and its size that we 
use is not bigger than 16 KB. However, we use the first configuration because 
it gives better running times. 

The threads can be synchronized in the scope of a block by the func- 
tion syncthreads(). It sets a semaphore which causes that the execution 

of the further code waits until all parallel threads reach the specified point. 

In our implementations we use atomic functions: atomic Add () and atom- 
icSubQ, available for devices of compute capability 2.x, which perform a read- 
modify-write operations on 64-bit words residing in the global memory. 
The atomic operations are slower than their non-atomic counterparts but 
they allow us to implement the programs without any synchronization 
of the threads. 

A bandwidth is the rate at which data can be transferred. The bandwidth 
between the global memory on device and the global memory on the host 
is much smaller than the bandwidth between the global memory on de- 
vice and the memory space on GPU. Therefore it is important to minimize 
the data transfer between the device and the host. In our implementation 
we strived to reduce the copying only to necessary arrays of data. 

To allocate and deallocate memory on the device we use the functions 
cudaM alloc () and cudaFrccQ and to copy memory between the device 
and the host we use the function cudaMemcpy(). 

In our implementations we use the cutil.h library available in the GPU 
computing SDK which allows us to detect the errors returned by the device. 

3 Basic Graph Definitions 

Let G = (V, E) be a directed graph and u : E — >■ R + be an edge capacity 
function. For formal reasons, if (x, y) ^ E then we set u(x, y) = 0. Let s/t 
be two distinguished verteces in V, the source and the sink, respectively. 
Then the triple F = (V, E, u, s, t) is called a flow network. 
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A pseudoflow is a function / : V x V — > R such that, for each 
(x, y) G E, f satisfies the capacity constraints: /(x, y) < u(x, y), and the skew 
symmetry: /(x, y) = —f(y, x). We say that the pseudoflow / is a flow if for 
each node x G V — {s, t}, Y2( x y)£E Z( x ' V) = 0- We say that an edge (x, y) is 
in the flow /, if f(x, y) > 0. The value of the flow f is |/| = £\ x ^ eE f(s, x). 

In the max flow problem we are given a flow network F with a capacity 
function u and distinguished vertices s, t. We must find a flow / from s to t, 
such that | f\ is maximum. 

The residual capacity of an edge (x,y) G E is Uf(x,y) = u(x,y) — 
f(x, y). An edge, for which Uf(x, y) > 0, is called a residual edge. The set 
of all residual edges in G is denoted by Ef. The graph Gf = (Vf, Ef) is called 
a residual graph. 

It is easy to see that if Uf(x,y) > then Uf(y,x) > and it is possible 
to push more units of flow through (y,x). Let Uf(x,y) > for (x,y) G E. 
If f(x,y) > than f(y,x) < 0. This implies that Uf(y,x) = u(y,x) — 
f(y,x) > because of u(y, x) > 0. 

To present the push-relabel algorithm we need to introduce the following 
definitions. Let F be a flow network and / :Vxl^->Rbea pseud- 
oflow function. For each x G V we define e(x), the excess of the node x, 

as the sum: e(x) = Y,( z ,x)aE f( z > x ) ~ T,{ x , y )eE f( x > v)- Note if / is a flow 
then e(x) = for all x G V. If e(x) > 0, for some node x G V, then we say 
that x is an active node. The height of a node x, denoted h(x) is a natural 
number from [0, n]. 

Now we give definitions related to the matching problems and max flow 
min cost algorithms. Let G = (V = X U Y, E) be a bipartite graph, 
\X\ = \Y\ = n, \E\ = m. The bipartite matching problem is to find 
a largest cardinality subset M C E such that each vertex x belongs to at most 
one edge of M. In case \M\ = n, M is called a perfect matching. 

If : E 1 — > R is a weight function for the edges then w(M) = Y^/ x y )eM w ( x i y) 
is the weight of the matching M. The assignment problem is to find 
the largest cardinality matching MCfiof the maximum weight w(M). 

For any graph G = (V, E) the cost function is c : E — > R. Hence, 
the cost of a pseudoflow / is c(f) = T.( x ,y)&E c ( x ^y)f{ x ^v)- 

Assume a flow network F = (V, E, u, s, t) is given with an aditional edge 
cost function c. The max flow min cost problem consist in finding a max- 
imum flow with the lowest possible cost. 

Following [9] we introduce definitions used in the cost scaling algorithm, 
they extend the notation given above. 

Let I' = (V, E, u, s, t, c) be a instance of the max flow min cost problem. 
A price of a node is given by a function p : V — > R. A reduced cost 
of an edge (x,y) G E is c p (x,y) = c(x,y) + p(x) — p(y), while a part- 
reduced cost of an edge (x,y) G E is c' p (x,y) = c(x,y) —p(y). 

For a constant e > a pseudoflow / is e-optimal with respect to a price 
function p if, for every residual edge (x,y) G Ef, we have c p (x,y) > — e. 
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For a constant e > a pseudoflow / is e-optimal if it is e-optimal with respect 
to some price function p. 

We say that a residual edge (x, y) G Ef is admissible if c p (x, y) < 0. 

4 Push-Relabel Algorithm 

4.1 Sequential Algorithm 

In this section we focus on a standard sequential algorithm solving the max 
flow problem. 

Among many sequential algoritms solving the max flow problem, there are 
three basic methods. The Ford- Fulker son and the Edmonds-Karp algorithms 
are the most common and easiest. The Ford-Fulkerson algorithm calculates 
the flow in time 0{E\ f* | ) , where | /* | is value of the max flow /* . The running 
time of the Edmonds-Karp algorithm is 0(VE 2 ). In every step, these two 
algorithms look for a new augumenting path from the source to the sink. If 
found, they increase the flow on the edges of the path. Otherwise they stop 
and return the flow found up to this moment (it is optimal). 

The third solution is the push-relabel algorithm. We present its generic 
version whose time complexity is 0{V 2 E). Next we discribe two heuristics 
which significantly improve its execution time. The generic push-relabel 
version with two heuristics can be effectivly parallalized and provides a basis 
for our further considerations. 

At the beginning of algorithm h(x) = 0, for all x G V—{s}, and h(s) = \V\. 
We have also e(x) = 0, for all x G V, and e(s) = oo (Algorithm 4.1). 

Algorithm 4.1. Init operation for the push-relabel algorithm 

for each (x, y) 6 E do 

f(x,y) <- 1 

f(v,x)<-0 
for each x £ V — {s} do 

e(x) «- 

h(x) «- 1 
e(s) «— co 

m <- \v\ 

As opposed to the previous two algorithms, push-relabel does not look for au- 
gumenting paths but acts locally on the nodes (Algorithm 4.2). 

Algorithm 4.2. The push-relabel algorithm 

InitQ 

make set S empty 

S^- s 

while (S is not empty) do 
x 4— S.pop() 
discharge(x) 
if (x is active node) 
S.push(x) 
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The discharge operation for each active node x with the set S selects 
either push or relabel operation (Algorithm 4.3). 



Algorithm 4.3. discharge(x), push(x, y) and relabel(x) operations 

discharge(x): 

if (3 (x,y) 6 E f : h(x) = h(y) + l) 
push(x, y) 

else 

relabel(x) 
push(x, y): 

S <— min {uf(x, y), e(x)} 
e(x) <— e(x) — S 

e (y) «- e (y) + 5 

f(x,y) <- f(x,y) + S 

f(y,%) «- f(y,x) - s 

relabel(x): 

h(x) -f- mia{h[y\ : (x,y) €E f } + l 



The push operation is performed on an active node x, for which there ex- 
ists an outgoing residual edge (x, y) 6 Ef and the node y satisfies the height 
constraint: h(x) = h(y) + 1. If the node x is active and every edge 
(x, y) G Ef does not satisfy this constraint then the relabel operation is per- 
formed. 

It can be shown that the generic push-relabel algorithm is correct, ter- 
minates and its running time is 0(V 2 E). The proofs and a comprehensive 
discussion about the generic push-relabel algorithm and its improvments, 
can be found in [1] and [2]. 

4.2 Heuristics: Global and Gap Relabeling 

The above version of the push-relabel algorithm has poor performance 
in practical applications. To improve the running time two heuristics are 
used: global relabeling and gap relabeling [2]. To get intuition how these 
heuristics work, we make some observations. 

During the execution of the algorithm, both the excess and the height 
of the source and the sink are not changing. Then, to the end of the al- 
gorithm, the source height is \V\ and the sink height is 0. Let us define 
a distance function. 

Definition 4.1 (distance function). Let G = {V, E) be a network with a flow 
f and let Ef be the set of its residual edges. The function h : V — > N is a 
distance function if h(s) = \V\, h{t) = 0, and for each edge (x,y) G Ef, 
h(x) < %) + !. 
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It can be proved ([1]) that the height function satisfies the properties 
of a distance function at each step of the algorithm. Hence we can think 
about the node height as its distance from the sink to the source. The major 
issue from which the algorithm's performance suffers is an execution of a lot 
of unnecessary relabel operations. It can be proved ([1]) that during the ex- 
ecution, a node height can reach a limit of 2\V\ — 1. The global relabeling 
heuristic (Algorithm 4.4) prevents the heights of the nodes from growing fast 
and assigns them the smallest admissible heights. 

Algorithm 4.4 global relabeling heuristic 

make Q empty queue 
for each x £ V 

x scanned 4- false 
Q.enqueue(t) 
t. scanned <— true 
while (Q not empty) do 
x •(— Q.dequeueQ 
current <— h(x) 
current <— current + 1 
V (y,x) 6 Ef. y not scanned do 
h(y) <— current 
y scanned <— true 
Q.enqueue(y) 



The global relabeling technique consists in performing a breadth-first 
backwards search (BFS) in the residual graph and assigning the new heights, 
equal to the level number of a node in the BFS tree. Obviously BFS takes 
linear time 0{m + n). Usually global relabeling is performed once every 
n relabels. This heuristic significantly improves the performance of the push- 
relabel method. 

The second heuristic is gap relabeling. The gap relabeling "removes" 
from the residual graph the nodes that will never satisfy the height con- 
straint. This improves the performance of the push-relabel method because 
of reducing the number of active nodes. However its result is not so sig- 
nificant for the running time as the global relabeling. The gap relabeling 
heuristic also can be done in linear time. 

The gap relabeling can be performed after the BFS loop of the global 
relabeling. Any non-scanned node x is not reachable from the sink so we can 
set its height to |V|. This makes the pushed flow omit x and go to another 
node y (from which there is an augumenting path). As a result the pushed 
flow gets faster to the sink. 

We have presented the generic push-relabel algorithm with two additional 
techniques which will be used next in parallel versions. Further improve- 
ments of the sequential push- relabel algorithm can be found e.g. in [1], [2] 
and [3]. 
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4.3 Parallel Approach 

The push-relabel algorithm was parallelized by Anderson and Setubal 
in 1992 [14]. One of the first CUDA implementation was proposed by Vineet 
and Narayanan ([4]). They presented the push-relabel algorithm to graph 
cuts on the GPU, which is a tool to find the optimal MAP estimation of en- 
ergy functions defined over an MRF [11, 12, 13]. 

Vineet and Narayanan's CUDA implementations were tested on the Nvidia 
280 and 8800 GTX graphic cards (devices of compute capability at most 1.3). 
Their algorithm works on grid graphs which arise in MRFs defined over im- 
ages. The dataset and CUDA implementations are available 
from Jhttp: / /cvit.iiit.ac.in/index.php?page=resources[ 

The authors assumed that each node of a graph is handled by one thread 
and the number of outgoing edges per node is fixed, equal 4. Authors sug- 
gested that the algorithm can be implemented for the expanded 3D graphs, 
that is with 8 outgoing edges per node. Their algorithm requires a computer 
architecture that could launch the same number of threads as the number 
of the graph nodes so possibilities to run this algorithm on a CPU are small. 

The authors prepared two implementations of push-relabel algorithm: 
atomic and non-atomic. The first of them requires two phases: push and re- 
label. The second implementation, which was designed for devices of compute 
capability lower than 1.2, additionally requires a pull phase. Further we will 
describe only the first implementation, more details about the second can be 
found in [4]. 

Vineet and Narayanan have used the graph construction of Kolmogorov 
et. al. ([12]) which maintains the grid structure, suitable for the CUDA ar- 
chitecture. The data of a grid graph are stored in the global and the shared 
memory of the device. They are in 8 separated tables. Table of the heights 
is stored in the shared memory and other tables are stored in the global mem- 
ory. Among them are the excesses, the relabel masks (which say whether 
the node is active), the residual capacities of edges upwards/downwards 
nodes, the residual capacities of edges towards the nodes on the left/right 
and the residual capacities of edges toward the sink. Access to the element 
in table is by calculating its index. 

Before running the algorithm, the host thread copies data to the global 
memory on device. The main loop of algorithm is executed on CPU and for each 
phase the host thread calls another kernel. After finishing a push kernel, 
the control is returned to the host thread and can be launch the next relabel 
kernel. However, authors suggested that for some grid graphs, running m 
push phases before each relabel phase, improved the execution time. Algo- 
rithm stops when all excesses stay the same after a few iterations of loop. 

In first step of the push kernel, each node saves its height in the shared 
memory of thread-block. After the saving, each node whose relabel mask 
is set to 1 pushes the flow toward its neighbors, if they satisfy a height 
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constraint. In this step the threads read the heights saved in shared memory. 

In the relabel phase first, each thread sets a new relabel mask. If ex- 
cess of the node is positive and is connected with neighbor, which satisfied 
a height constraint, the mask is set to one (it is active). If node only has 
positive excess, the mask is set to zero (node is passive). Otherwise, mask 
is set to two (inactive node) and this node will never be active. After set- 
ting all the relabel masks, threads can calculate new heights of the nodes. 
They read the old heights saved in the shared memory and write the new 
to the global memory. It is preparation for next a push phase. 

To improve an execution time of algorithm, Authors experimented with vary- 
ing numbers of threads per block. The best result, they obtained for a thread- 
block of size 32 x 8. 

In the Vineet et al. implementation, the synchronization of threads is as- 
sured by syncthreadsQ CUDA function. This approach to the parallel 

push-relabel algorithm blocks the threads which are ready to run next oper- 
ation before finishing other threads. Therefore it causes long execution time 
of the algorithm. 

4.4 Parallel Lock-Free Algorithm 

In 2008 Hong presented a lock-free multi-threaded algorithm for the max 
flow problem ([5]) based on Goldberg's version ([2]) of the push- relabel al- 
gorithm. Implementation of Hong's algorithm requires a multi-threaded ar- 
chitecture that supports read-modify-write atomic operations. In our imple- 
mentation we have used a Nvidia CUDA atomicAdd(int*, int, int) and atom- 
icSub(int*, int, int) functions. 

Without lost of generality we assume that the number of running threads 
is \V\ and each of them handles exactly one node of the graph, including all 
push and relabel operations on it. In several, a few nodes can be handled 
by one thread. 

Let x be the running thread representing the node x 6 V. In Hong's 
algorithm (Algorithm 4.5) each of the running threads has the following pri- 
vate atributes. The variable e' stores the excess of the node x. The variable 
b! stores the height of the currently considered neighbour y of x such that 
(x, y) G Ef. The variable h stores the height of the lowest neighbour y of x. 

Other variables are shared between all the running threads. Among them 
there are the arrays with excesses and heights of nodes, and residual capac- 
ities of edges. 

First, the Init operation is performed by the master thread, in CUDA 
programing it is the host thread. This init code is the same as its counter- 
part in the sequential push-relabel version. Next, the master thread starts 
the threads executing in parallel the lock-free push-relabel algorithm (in 
CUDA the host thread launches kernels). 
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The basic changes, introduced by Hong, deal with the selection of opera- 
tion (push or relabel) that should be executed by x, and to which of the adja- 
cent nodes y, (x, y) G Ef, a flow must be pushed. In opposite to the push op- 
eration of the generic sequence version where any node y connected by a resid- 
ual edge to x such that h(x) = h(y) + 1 could be pushed, it selects the lowest 
node among all the nodes connected by a residual edges (lines 4-9). Next 
if the height of y is less than the height of x (line 10), the push operation is 
performed (lines 11-15). Otherwise, the relabel operation is performed, that 
is, the height of x is modified to h(y) + 1 (line 17). Note that the relabel 
operation need not be atomic because only the x thread can change the value 
of the height of x. Furthermore, all critical lines in the code where more than 
two threads execute the write instruction are atomic. Hence it is easy to see 
that the algorithm is correct in respect to read and write instructions. 

Algorithm 4.5. Lock-free multi-threaded push-relabel algorithm by Bo Hong 

InitQ: 

h(s) «- \V\ 

for each x 6 V — s 

h(x) <- 
for each (x,y) 6 E 

Uf(x,y) <- u xy 

uf{y,x) «- u yx 

for each (s, x) 6 E 
Uf(s, x) <— 
Uf(x, s) -f— u xa + u sx 
e{x) «- u sx 

lock-free push-relabel(): 

1. /*x - node operated by the x thread*/ 

2. while (e(x) > 0) do 



3. el «- e(x) 

4. y «- NULL 

5. for each (x,y) S Ef do 

6. h' <r- h(y) 

7. if h > h' do 

8. h ti 

9. _y <- y 

10. if h(x) > h do /*the x thread performs PUSH towards y*/ 

11. 5 4—min{e', Cf(x,y)} 

12. c f (x,y) <r- Cf(x,y) - 5 

13. c s ({y),x) i-Cf({y),x) + 6 

14. e(x) <— e(x) — 5 

15. e((y))^e((y)) + 5 

16. else do /*the x thread performs RELABEL*/ 

17. h(x)<-h + l 



All running threads have access to a critical variable in the global mem- 
ory, but indeed their task are peformed sequentialy thanks to the atomic 
access to the data. The order of the operations in this sequence cannot be 
predicted. Despite this the algorithm can be proved correct. 
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It can be proved that the lock-free algorithm terminates after at most 
0(V 2 E) the push/relabel operations. Because it is executed in parallel 
by many threads, the complexity of algorithm is analyzed in the number 
of the operiations, not in the execution time. 

4.5 Lock- Free Algorithm and Heuristics 

In 2010 Hong and He ([6]) improved the lock-free push-relabel algorithm, 
by adding to it a sequential global relabel heuristic performed on CPU (Algo- 
rithm 4.6). According to the authors and their experimental result, the new 
CPU-GPU-Hybird scheme of the lock-free push-relabel algorithm is robust 
and efficient. 

Algorithm 4.6. CPU-GPU-Hybird of push-relabel algorithm by Hong and He 

InitQ: 

1. initialize e, h, Uf and ExcessTotal 

2. copy e and u j from the CPU main memory to the CUDA global memory 
push-relabel- cpuQ : 

1. while (e(s) + e(t) < ExcessTotal) do 

2. copy h from the CPU main memory to the CUDA global memory 

3. call push-relabel-kernel() 

4. copy Uf,h and e from CUDA global memory to CPU main memory 

5. call global-relabel-cpu() 



Following the Authors we will talk about the CPU-GPU-Hybrid algo- 
rithm in the context of CUDA programming. Similarly as in the previous 
algorithm we assume that each node is operated by at most one thread. 
The previous version of the lock-free push-relabel algorithm will be called 
the generic algorithm, while the CPU-GPU-Hybrid scheme will be named 
the hybrid algorithm. 

The initialization of the hybrid algorithm is the same as its counterpart 
in the generic algorithm. The hybrid algorithm maintains 3 arrays with 
excesses, heights and residual capacities of the nodes and the edges and 
keeps the global variable ExcessTotal, equal to the value of the flow pushed 
from the source. ExcessTotal resides in the global memory on the host and 
can be changed during the global relabeling. 

In opposite to the generic algorithm, the main body of the hybrid algo- 
rithm is controlled by the host thread on CPU. The host thread executes 
the while loop until the cumulative value of the excesses stored in the source 
and the sink achieves the value of ExcessTotal. In this moment all the valid 
flow gets to the sink, and the rest of a flow returns to the source. Then, 
the excess at the sink equals the value of the maximum flow. 

In the first step of the while loop, the host thread copies the heights 
of the nodes to device and launches the push-relabel-kernel. When the control 
is returned back to the host thread, the calculated pseudoflow and the heights 
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of the nodes are copied to the CPU memory and the global-relabel-cup is per- 
formed. 



Algorithm 4.7. Initialization for CPU-GPU-Hybrid 



InitQ: 

1. h(«)«-|V| 

2. e(s) i- 

3. for each x £ V — s 

4. ft(x) <- 

5. e(x) «- 

6. for each (x,y) S E 

7. Uf(x,y)<-u xy 

8. Uf(y,x) <- u yx 

9. for each (s, x) & E 

10. u/(s,x) «— 

11. U/(x, s) -f~ Mis + U SI 

12. e(x) «- u sx 

13. ExcessTotal <— ExcessTotal + u. 



Algorithm 4.8. Lock-free push-relabel and global relabel for CPU-GPU-Hybrid 



lock-free push-relabel(): 

1. /*x - node operated by the x thread*/ 

2. while {CYCLE > 0) do 

3. if (e(x) > and h(x) < \V\) do 

4. e' «- e(x) 

5. y <- NULL 

6. for each (x,y) £ do 

7. A' <- h(y) 

8. if ft > ft' do 

9. ft <- ft' 

10. __y*-y 

11. if ft(x) > ft do /*then the x thread perform PUSH towards y*/ 

12. <5 <— min{e', cy(x,j/)} 

13. c f (x,y) «- c f (x,y) - 8 

14. c/((j/),x) «- c f ((y),x) +8 

15. e(x) <— e(x) — (5 

16. e((j/)) <- e((j/)) + 5 

17. else do /*then the x thread perform RELABEL*/ 

18. ft(x) <- ft + 1 

19. CYCLE 4- CYCLE - 1 

global relabeling heuristicQ: 

1. for all (x, t/) £ B do 

2. if (ft(x) > h(y) + 1) then 

3. e(x) 4— e(x) — Uf(x,y) 

4. e(j/) <- e(2/) + Uf(x,y) 

5. Uf(y,x) 4- Uf(y,x) +Uf(x,y) 

6. Uf(x,y)*-0 

7. do a backwards BFS from the sink and assign the height function 

8. with each node's BFS tree level 

9. if (not all the nodes are relabeled) then 

10. Mx e V do 

11. if (x is not relabeled and marked) then 

12. mark x 

13. ExcessTotal <— ExcessTotal — e(x) 
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The push-relabel-kernel algorithm differs from the generic algorithm 
in the timing of executing the kernel. The thread stops the while loop 
after CYCLE iterations (where CYCLE is an integer constant defined 
by the user) and not when its node becomes inactive. After stopping the loop 
the heuristic is called, and then the loop is initialized again. 

Since the while loop can terminate at any moment (randomly in respect 
to the original sequential flow computation) it may occur that the property 
of some residual edge (x,y) G Ef is violated, i.e. h(x) > h(y) + 1. Then, 
before computing the new heights of nodes, all the violating edges must 
be canceled by pushing the flow. It is made in lines 1-6 of the global rela- 
beling heurisitic. Next, the nodes are assigned new heights by performing 
in the residual graph a backwards BFS from the sink towards the source. 
The new heights are equal to the shortest distances towards the sink in the 
residual graph. The excesses of nodes, which are not availiable from the sink 
in backwards BFS tree, must be substracted from ExcessTotal, because it 
is a stored excess which will never reach the sink. 

In 2011 Hong and He improved both heuristics to a new Asynchronous 
Global Relabeling (ARG) method ([7]). So far, the global and gap relabel- 
ing heuristics were run independent from the lock-free algorithm (in CUDA 
implementation, to run the heuristics the control was returned to the CPU). 
The main reason for this was that the push and the relabel operations are 
mutually exclusive with the global and gap relabeling heuristcs (a critical 
moment is when both the heuristics and the relabel operation want to set 
a new height for the same vertex). However, this problem does not oc- 
cur for the non-lock-free versions of the push-relabel algorithms (which is 
contrary to our expectations). In the new approach, the ARG heuristic is 
executed by a distinguished thread which corresponds to any vertex and runs 
periodicaly, while the other threads asychronously run push or relabel oper- 
ations. It significantly improves the execution time of the algorithm. The 
only problem is that ARG needs to maintain a queue of the unvisited ver- 
tices whose size is 0(V). In CUDA programming, a queue of that size can be 
maintained only in the global memory the access to which is very slow. Per- 
haps this is why the implementation presented in [7] uses C and the pthread 
library for multi-threaded constructions. 

4.6 Our Implementation 

In our implementation of the lock-free push-relabel algorithm we have 
tried to use the ARG heuristic but from the reasons mentioned above this al- 
gorithm turns out to be slower than the algorithm using a heuristic launched 
on CPU. Therefore we use the approach presented in Algorithm 4.8 with 



14 



an additional improvment. In the end of the global relabeling phase we add 
the gap relabeling heuristic which for each unvisited node in the BFS tree 
sets its height to \V\. 

We implement the procedure of the lock-free push-relabel algorithm 
as a CUDA kernel. It is executed by \ V\ +2 threads. After CYCLE iterations 
of the while loop in the kernel, the control is returned to the Host thread. 
The constant CYCLE is set to 7000 by a preprocessor macro (in our tests 
this value yielded best results). Next the host thread calls the C procedure 
running on CPU which performs a global relabeling. The algorithm runs 
until all the flow of the value ExcessTotal gets to the sink, then the excess 
of the sink is equal ExcessTotal. Note that during the execution of the heuris- 
tic, the value of ExcessTotal can be decreased. 

In our implementation a vertex is a structure node holding four point- 
ers: excess, height, toS our ceOr Sink and firstOutgoing. The pointer toSorce- 
OrSink points to the edge leading directly towards the source or the sink. 
This makes the access to the source and the sink faster. The pointer firstOut- 
going points to the first edge on the adjacency list of the vertex. The struc- 
ture of an edge (named adj) also holds four attributes. The pointer vertex 
points to the neighbor to which the edge leads. The pointer flow points 
to the location in the global memory where the residual capacity of the 
edge is stored. The attribute mate is a pointer to the backward edge in the 
residual graph and next is a pointer to the next edge on the adjacency list. 

During the push operation peforming on the edge (x, y) both the residual 
capacities of edges (x, y) and (y, x) are changed. Then to improve the cache 
utilization, the residual capacities of edges (x, y) and (y, x) are stored one 
after another. 

Throughout the algorithm only flows, excesses and heights need to be 
copied between the device and host. Therefore to minimize the data trans- 
fer between the host and the device we separate arrays containing these 
data from the structures. Since global relabel heuristic gets all these arrays 
at the beginning than they are copied onto the host. On the other hand after 
performing the heuristic only the heights need to be copied back to the de- 
vice. Therefore we store the excesses and the residual capacities in a single 
array and the hights are stored in a separate array. 

Before starting the algorithm the two arrays are allocated on the device 
containg nodes and edges respectively. The first keeps the structures of type 
node and the second contains the structures of type adj. During the execution 
the algorithm arrays of nodes and edges are not copied to the host. 

5 Cost Scaling Algorithm 

It is known that the non-weighted matching problem can be easily re- 
duced to the max flow problem (for more details see [1, paragraph "Maximum 
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bipartite matching"]). In [9] Goldberg and Kennedy present a way how to ef- 
ficiently solve the weighted matching problem with a cost scaling algorithm. 
In their work they reduce the assignment problem to the transforation prob- 
lem and present their implementation of the algorithm. 

We present an analogous reduction from the assignment problem to the max 
flow min cost problem. In [9], for a given graph G' = (V = X' U Y',E'), 
they additionaly define a supply d(x), x £ V that Vx G X, d(x) = 1 
and Vy G Y, d{y) = — 1. It stimulates preflow pushed from the source to ev- 
ery node of X and preflow pushed from every node of Y to the sink. Hence 
the push relabel algorithm starts execution with e(x) = d{x). In our work, 
instead of defining the supply and the transporation problem, we initialize 
the push-relabel algorithm with e{x) = 1, x G X and e(x) = —1, y G Y. 

Let / be an instance of the assignment problem: / = (G,w), G = 
(V = X U Y, E) where \X\ = \Y\ and w is a weight function for edges. 
We construct an instance I' = (G', u, c) of the max flow min cost problem 
as follows. For each edge (x, y) G E we add (x, y) and (y, x) to E' . For each 
(x,y) G X x Y define capacities: u(x, y) = 1 and u(y, x) = 0, and costs: 
c(x,y) = w(x,y) and c(y,x) = —w(x,y). The graph G' is still bipartite. 



Cost scaling algorithm 



P R 



I' 



Refine 

e = E/a 



Push -relabel 
lock-free 



e<l/n? YES 



NO 



Figure 2: Cost scaling algorithm. 
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5.1 Sequential Cost Scaling Algorithm 



The two cost scaling algorithms with efficient implementations which are 
described by Goldberg et al. in [8,9], are slightly different. The first of 
them [8] is the generic cost scaling algorithm and was proposed by Goldberg 
in [10]. The second of them [9] was applied to the assignment problem. 

Now we present the first version (Algorithm 5.0), next we point out 
differences from the second which requires some changes in the definitions 
of the e-optimal pseudoflow and the admissible edge. Our version of the 
algorithm uses the unmodified definitions and will be presented in subsec- 
tion 5.3. 

Let C be the largest cost of an edge in G. 

Algorithm 5.0. The cost scalling algorithm 

1. Min-Cost(): 

2. e «- C 

3. for each x £ V do 

4. p(x) «— 

4. while (e > 1/rt) do 

5. (e,f,p) <- Refine(e,p) 

1. Refine(e,p): 

2. e i- e/a 

3. for each (x, y) 6 E : c p (x, y) < do /*E contains edges of G, not Gf*/ 

4. f(x,y)<-l 

5. while (f is not a flow) do 

6. apply a push or a relabel operation 

7. return (e, /, p) 

1. push(x, y): 

2. S -s— min {e(x), Uf(x, y)} 

2. send 8 units flow from x to y 

1. relabel(x): 

2. p(x) <- maa: (:CiZ)6B/ {p(z) - c(x,z) - e} 



The main procedure of the algorithm is a Min-Cost method which main- 
tains a variable e, a flow /, and a price function p. These variables are 
changed in the while loop by the procedure Refine. On the beginning of each 
step of the loop the flow / is e-optimal with respect to p. The while loop 
stops when e < 1/n. It was proved that if Refine reduces the parameter e 
by a constant factor, the total number of iterations is 0(m log n) (paper [10], 
Theorem 4.5). 

The procedure Refine starts with decreasing e to e/a and saturating ev- 
ery admissible edge (x, y), ie. c p (x, y) < 0. This spoils the initial flow / such 
that / becomes an e-optimal pseudoflow, for e = (because V(x, y) 6 E, 
c p (x, y) > 0). This makes also some nodes active and some with a negative 
excess. Next the pseudoflow / become an e-optimal flow by making a se- 
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ries of the flow and the price update operations, each of which preserves 
e-optimality. There are two kind of the update operations: push and relabel. 

The push(x, y) operation is applied to an active node x and a residual 
edge (x, y) that is admissible: c p (x,y) < 0. Then 5 units of the flow is put 
to the node y: decreasing e(x) and f(x, y) by 5 and increasing e(y) and 
f(y,x) by 5. 

The relabel(x) operation is applied to an active node x such that (x, y) G Ef 
and (x, y) does not satisfy the admissible constraints, i.e c p (x, y) > 0. The new 
value of p(x) is the smallest value allowed by the e-optimality constraints, 
ie. max( XtZ)GEf {p(z) - c(x,z) - e}. 

Goldberg proved that the Refine procedure is correct, that is, if it termi- 
nates, the pseudoflow / is an e-optimal flow. Hence the min-cost algorithm 
is also correct (see [10]). 

The procedure Refine maintains a set S containing all the active nodes. 
The loop terminates when S becomes empty. The generic implementations 
of the procedure Refine runs in 0(n 2 m) time, giving 

an 0(n 2 ramin{log(nC),mlogn}) time bound for computing a minimum- 
cost flow. 

The differences between the algorithms mentioned above occur in a new 
definition of an admissible edge, in the initialization stage of the refine pro- 
cedure, and in the relabel operation. 

Let e > 0. A residual edge (x, y) G Ef is admissible, if (x, y) G (X x7)fl Ef 
and c p (x,y) < \e or (x, y) G (Y x X) HEf and c p (x,y) < —\e. Then we 
have different conditions for the two types of edges. 

The e-optimality notation is closely related to the admissible edge defi- 
nition. Therefore after changing it we must later update the former, as well. 
For a constant e > a pseudoflow / is e-optimal with respect to a price func- 
tion p if, for every residual edge (x, y) G (X x7)fl Ef, we have c p (x, y) > 0, 
and for every residual edge (y,x) G (Yxl)n Ef, we have c p {y,x) > — e. 
For a constant e > a pseudoflow / is e-optimal if it is e-optimal with respect 
to some price function p. 
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Algorithm 5.1. The cost scaling algorithm, version 2 



1. Refine(e, p): 

2. e <- e/a 

3. for each (x,y) g E 

4. f(x, y) «— /*it can make some node active*/ 

5. for each x a X 

6. p(x) •( min (x>y)eE c' p (x,y) 

7. while (f is not a flow) do 

8. apply a push or a relabel operation 

9. return (e, /, p) 

1. relabel(x); 

2. if x £ X then 

3. p(x) «- max {x y)eEf {p(y) - c(x, y)} 

4. else if x S V then 

5. p(x) «- maX( ZtX)eEf {p(z) + c(z, x) - e} 



On the start of the procedure refine, e and an e-optimal flow / are 
given. Similary as in the first algorithm, e is decreased to e/a. Next, 
for each saturated edge (x,y), the flow is removed from (x,y) back to its 
node x. This makes some nodes active and some nodes obtain negative ex- 
cess: e(x) < 0. Before the loop starts, for each node x G X, p(x) is set 
to —min^ x y ^ eE c' p (x,y). Then, we get an e-optimal pseudoflow /, for e = 
(because of V(x, y) G Ef n (X x Y) c p (x, y) > and V(y, i) G £/ fl (7 x X) 
c p (y,x) > — e). Further, the push and the relabel operations are performed 
to make / an e-optimal flow. 

It is easy to see that the differences between the algorithms do not result 
in any change in the returned output but they have impact on the efficiency. 
Therefore our idea is a combination these two codes (Algorithm 5.2.). We 
assume the first version of the definitions of admissible edges and e-optimal 
pseudoflows. 

Algorithm 5.2. The cost scalling algorithm, our approach 

1. Refine(e,p): 

2. e <- t/a 

3. for each (x,y) £ E 

4. f(x, y) 4— /*it can make some node active*/ 

5. for each i6l 

6. p(x) i min/ Xt y\£EW p (x, y ) + e} /*it makes pseudoflow / e-optimal, (also 0-optimal)*/ 

7. while (f is not a flow) do 

8. apply a push or a relabel operation 

9. return (e, /, p) 

1. relabel(x): 

2. p(x) i min (XiZ)eEf {c' p (x,z) + e} 
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The initialization of the procedure refine (lines 2-6) is the same as in Al- 
gorithm 5.1 except line 6, where we must adapt it to the appropriate as defi- 
nition the admissible edge. The relabel operation is the same as in Algorithm 
5.0. 

5.2 Heuristics: Price Updates and Arc Fixing 

The papers [8] and [9] provide a lot of inprovements to the cost scaling 
algorithm for its sequential version. We focus our attention on the price 
updates and arc fixing heuristics. The idea of the price updates heuristic 
(introduced in [10] and described also in [15]) is similar to Dijkstra's short- 
est path algorithm, implemented using buckets as in Dial's implementation 
[16] of the shortest path algorithm. 

Algorithm 5.3. Price updates heuristic 



price-updates-heuritic ( ) : 

make empty sets B[-\ /* buckets of unscanned nodes*/ 
make empty set S /* set of active nodes */ 
make array I /* constisted the labels of nodes*/ 
for each iGV 

x. scanned «— false 

l(x) <— CO 

if e(x) < then 

_B[0].push(:r) /* bucket is consisted of the x which excess is negative*/ 
x. bucket «— 

else 

x. bucket «— oo 
if (e(x) > 0) then 

5.push(x) /* active node x is pushed to S */ 

% <- 

while (S is not empty) do 

while (B[i] is not empty) do 
x «— _B[i].pop() 
for each (y, x) e Ef do 

if (j/.scanned = false and [c p (y,x)/e\ + 1 < y. bucket) do 
old <— y. bucket 
new [c p (y, x)/e\ + 1 
y. bucket <— new 
B[old].pop(y) 
B[new].pnsh(y) 
x. scanned <— true 
l(x) «- i 

if (e(x) > 0) then 

S.pop(x) 
last «— i 
i <- i + 1 
for each i£V 
if l(x) < co 

p(x) «— p(x) — e l(x) 

else 

p(x) «— p(x) — e (last + 1) 



In our implementation, the price update heuristic maintains the set of buck- 
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ets B[-]. Each node belongs to at most one bucket. Any node with a neg- 
ative excess is in B[0] and initially buckets with the number larger than 
are empty. Therefore, each node with a non-negative excess belongs to num- 
ber oo. In general, for any node the number of its bucket equals its distance 
from any node residing in the 0-bucket. Additionaly, the heuristic mainatains 
an array I containing of the labels, which are distances to some node with neg- 
ative excess i.e. a node residing in the 0-bucket. The distances are multiplies 
of e. 

In each iteration of algorithm, a node with the lowest distance (ie. a node 
residing in a nonempty bucket with the lowest number) is scanned. Af- 
ter the scanning, the distance of that node is set to the number of its bucket 
and the node is marked as scanned. 

During the scanning, residual edges entering the scanned node are tested. 
If a neighbor was not scanned yet and its new calculated distance is smaller 
than the present, then its distance is updated to a new and this neighbor 
is removed from the old bucket and put into to a new bucket. 

In our implementation we also use the arc heuristic described in [8]. 
It deletes some edges from a residual graph thus decreasing the running 
time. For the e-optimal flow / and edge e, if c p (e) > 2ne then the flow 
of e will never be changed. Therefore this edge can be permanently omitted. 

5.3 Applying Lock-Free Push-Relabel Method to Cost Scal- 
ing Algorithm 

To get better running time of the Refine procedure than in a sequential 
algorithm, we improve it by applying the lock-free push-relabel algorithm 
(Algorithm 5.4). 



21 



Algorithm 5.4. Push-relabel, our approach 



refine(G, e,p): 

1. /* x - node operated by the x thread */ 

2. while (e(x) > 0) do 



3. e' <- e(x) 

4. j> «- NULL 

5. min_c' p «— oo 

6. for each (a:,i/) g i?y do 

7. tmp_c' p <- c' p (x,y) 

8. if min^c'p > tmp_c' p do 

9. min_c' p «— tmp_c' p 

10. V ^ y 

11. if (min_Cp < — p(rr)) do /* that is: c p (x,y) < - edge is admissible */ 

12. /* then the x thread performs PUSH towards ?/ */ 

13. u f (x,y) 4- u f (x,y) - 1 

14. u f (y,x) <^Uf(y,x) + l 

15. e(a;) <— e(ai) — 1 

16. e(jr) 4- e(y) + 1 

17. else do / *then the x thread performs RELABEL */ 

18. p(x) i (min_c' p + e) 



As in Hong's algorithm, we assume that one node can be operated by at most 
one thread. Each thread has the following private atributes. The variable e 1 
stores the excess of the node x. The variable y stores the node with the low- 
est reduced cost. The variable min_c' p stores the lowest partially reduced 
cost of the edge (x,y). The variable tmp_c' p stores the temporarily reduced 
cost of the edge (x,y). The arrays with excesses e and prices p of nodes, and 
costs c and residual capacity Uf of edges are shared beetwen all the running 
threads. 

Let x be an active node operated by thread x. Before performing push 
or relabel operation, x select the residual edge (x,y) G Ef, whose reduced 
cost is the lowest among all residual edges outgoing from x (lines 6-10). 
Let (x,y) be the edge with the reduced cost min_c' p . The thread x verifies 
whether (x, y) is admissible (line 11). If so, x puts the unit of flow towards y: 
decreasing excess of x and residual capacity of (x,y), and increasing excess 
of y and residual capacity of (y,x) (lines 13-16). Otherwise, if (x,y) is not 
admissible, the relabel operation is performed on x and new price of x is 
p(x) i (min_c' p + e) (line 18). 

Obviosly, the decreasing and increasing variables in the push operation 
must be atomic because of the write conficts, which may occur when two 
threads will want to change excess of the same node. 

5.4 Correctness of Algorithm 

The correctness of the algorithm follows from the correctness of the Refine 
procedure. 

The main difference between our algorithm and the push-relabel algo- 
rithm for the max flow problem is that in the second method for each node, 
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each of its neighbors sees the same height of that node. In our algorithm 
for each node, each of its neighbors sees the same price of that node but to de- 
termine the e-optimality edge, the reduced cost of an edge (not the price of 
the node) must be considered, which can be different for each neighbors. 

Our proof is based on Hong's proof of the correctness of the lock-free 
push-relabel algorithm [5] and Goldberg and Tarjan's proof of the correctness 
of the generic refine subroutine [2]. 

Lemma 5.1. During the execution of the refine operation, for any active 
node one of two operations: push or relabel can be applied. 

Proof Let x be an active node. Then there must be an edge (x, y) 6 Ef 
because a push operation has occured which increased the value of e{x) 
and Uf(x,y). Let c p (x,y) >= 0. Hence (x,y) is not admissible and a push 
operation is impossible. Therefore a relabel operation can be applied. □ 

Lemma 5.2. During the execution of the refine operation the price of a node 
never increases. 

Proof. Let x be an active node. If there is a residual edge (x, y) 

such that c p (x,y) < then the push operation can be applied to it 

and the price of x does not increase. 

Otherwise, if for each residual edge (x,y), c p (x,y) > = then the relabel 
operation can be applied to x. Let (x, y) G Ef have the smallest reduced 
cost among all the residual edges outgoing from x. Since c p (x, y) > = 
then c(x,y) + p(x) — p(y) >= and p(x) >= p(y) — c(x,y) = —c' p (x,y). 
Additionaly, after the relabel operation we have p(x) = — (c' p (x, y) + e). 
Therefore the price of x does not increase. □ 

Now, following Hong's proof, we define a trace, a preparation stage 
and a fulfillment stage and two basic types of computation (history) traces 
consisting of push and/or relabel operations. 

The trace of the interleaved execution of multiple threads is the order 
in which instructions from the threads are executed in real time. 

Each operation (both push or relabel), can be split into 2 stages: the prepa- 
ration and the fulfillment. 

For the push operation the preparation stage is performed in lines 6-11 
of Algorithm 5.4 and the fulfillment stage is lines 12-16. For the relabel 
operation the preparation stage is in lines 6-10 of Algorithm 5.4 and the ful- 
fillment stage is in line 18. 

The preparation stage tests if the operation is aplicable. The fulfillment 
stage finishes the operation. The P and F notations denote the preparation 
and the fulfillment stage respectively. 

A stage-clean trace consists of non overlapping operations, for example: 
(P(pushl), F(pushl), P(relabell), F(relabell), P(push2), F(push2)). 
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In stage- stepping trace all the preparation stages are performed before any ful- 
fillment stage, for example: (P(pushl), P(relabel2), P(relabell), F(relabel2), 
F(pushl), F(relabell)). 

Lemma 5.3. Each trace consisting of push and/or relabel operations is se- 
mantically equivalent]) to one of the two basic traces: stage-clean or stage- 
stepping. 

Proof There are 5 nontrivial cases in which the stages of push and relabel 
operations on common data can interleave: 

1. push(x, y) and push(y, z) 

The fulfillment stage of push(x,y) increases e(y). However the prepa- 
ration stage of push(y,z) reads e(y). The reading and the writing 
operations can occur in the following three scenarios. 

(a) P(push(x,y)) - F(push(x,y)) - P(push(y, z)). 

This is a stage-clean trace: push(x, y) -> push(y, z). In the fulfill- 
ment of push(x, y) the thread x increases e(y) before the thread 
y reads e(y) in the preparation stage. 

(b) P(push(y, z)) - F(push(y, z)) - F(push(x,y)). 

This is a stage-clean trace: push(y, z) -> push(x, y). In the ful- 
fillment of push(y, z) thread y decreases e(y) before the thread x 
writes to the e(y) in fulfillment stage. 

(c) P(push(y, z)) - F(push(x,y)) - F(push(y, z)). 

This is also a stage-clean trace: push(y, z) -> push(x, y). In the ful- 
fillment of push(x, y), the thread x increases e(y), which was ear- 
lier remembered by thread y. Next, thread y pushes the flow, 
whose value is depended from the old value of e(y), to the node 
z. So it is equvalent that the thread y pushes the flow to z 
and then the push operation from x to y is performed. 

2. push(x, y) and push(z, y) 

Both push(x, y) and push(z, y) increase the value of e(y) without read- 
ing and storing this value before. It may occur in two different scenar- 
ios, each of them is equivalent to a stage-clean trace. 

3. push(x, y) and relabel(y) 

In the preparation of push(x, y) the thread x chooses the edge (x, y) 
because it has the lowest partially reduced cost c' p (x,y). However, 
in the fulfillment stage the thread y updates the price p(y). We have six 
different scenarios interleaving the stages of push(x,y) and relabel(y). 
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(a) P(push(x,y)) - F{push{x,y)) - P{relabel{y)) - F{relabel(y)) 
Obvious. It is equivalent to the stage-clean trace: push(x, y) -> 
relabel{y). 

(b) P{relabel{y)) - F{relabel{y)) - P(push(x,y)) - F(push(x,y)) 
Obvious. It is equivalent to the stage-clean trace: relabel{y) -> 
push(x, y). 

(c) P(push(x,y)) - P(relabel(y)) - F(push(x,y)) - F(relabel(y)) 
P(push(x,y)) - P(relabel(y)) - F(relabel{y)) - F(push(x,y)) 
P(relabel(y)) - P(push(x,y)) - F(relabel(y)) - F(push(x,y)) 
P(relabel(y)) - P(push(x,y)) - F(push(x,y)) - F(relabel(y)) 

In this cases two preparation stages are performed before any 
fulfillment stage one. Therefore they are equivalent to the stage- 
stepping trace: P(push(x,y) -> P{relabel(y)) -> F(push(x,y)) 
-> F(relabel(y)). 

4. push(x, y) and relabel(z) 

Let p{z) be the new price of z after the fulfillment stage of relabel(z). 

According to lemma 4.2,: p(z) < p(z) must hold. 

We have c p (x, y) = min^)^ c p (x, w) < c p (x, z). 

Then c p (x, z) = c(x, z) +p(x) — p(z) < c(x, z) +p{x) —p(z) = c p (x, z). 

Since c p (x,z) > c p (x,z) > c p (x,y) > 0, then the operation relabel(z) 

does not impact the operation push(x, y) and this scenario is equivalent 

to a stage-clean trace. 

5. relabel(x) and relabel(y) 

In the fulfillment stage of relabel(x) the thread x updates p(x) and (y, x) 
may be the residual edge read by the thread y before or after the ful- 
filment stage of relabel(y). Then there are six scenarios interleav- 
ing the stages of relabel(x) and relabel{y). The proof is analogous 
to the case 3. 

□ 

Lemma 5.4. For any trace consisting of three or more push and/or relabel 
operations there exists an equivalent sequence consisting only of stage-clean 
or stage- stepping traces. 

Proof. Similar as above. □ 
Lemma 5.5. When the algorithm terminates f is an e- optimal flow. 

Proof. We show that during the execution of the algorithm any residual 
edge (x,y) satisfies the constraint c p (x,y) > —e with one exception which is 
transient. Let / be an e-optimal pseudoflow. There may occur the following 
situations: 
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1. applying the push operation 
Obviously. 

2. applying the relabel operation 

Let x be the relabeled node and p(x) be the new price of x. According 
to Lemma 5.2, we have p(x) < p(x). Then any edge of the form: 
(x, y) or (y, x) can spoil the e-optimality of /. For any residual edge 
(x,y) outgoing from x, we have c p (x,y) = c(x,y) + p(x) — p(y) < 
c(x,y) +p(x) - p(y) = £p(x,y) Therefore, if c p (x,y) > c p (x,y) > -e 
then the relabel operation preserves the e-optymality. 

3. applying relabel(x) and relabel(y) operations 

According to Lemma 5.3 this scenario can be reduced to a stage-clean 
trace or stage-stepping trace. A clean-stage trace can be reduced 
to cases: 1. and 2. above. A stage-stepping trace has the following 
cases. 

(a) (x,y) G E f and (y,x) G Ef 

In this case, c p (x, z) >= 0, for all (x, z) G Ef, and c p (y, w) >= 0, 
for all (y,w) G Ef. Therefore, since 

c P (x,y) = c(x,y) +p{x) - p{y) = -(c(y,x) + p(y) - p{x)) = -c p (y,x) 
then Cp(x,y) = -c p (y,x) = 0. 

Furthermore, c p (x,y) = mm{c p (x, z), (x,z) G Ef} and 
c P (y,x) = mm {c p (y,w), (y,w) G Ef}. 

Therefore, the fulfillment stages of relabel(x) and relabel(y) up- 
date the prices of x and y respectively: 
p{x) < (Cp(x,y) + e) and p(y) i {c' p {y,x) + e). 

Let us check whether the new prices preserve e-optimality of the resid- 
ual edges (x,y) and (y,x). Since 

c p (x,y) = c(x,y)+p(x)-p(y) = c(x, y) - d p (x, y) - e + d p (y, x) + 
e = c(x, y) - c(x, y) + p{y) + c(y, x) - p(x) = c(y, x) + p(y) - 
p(x) = c p (y,x) = >= — e, then (x,y) preserves e - optimality 
of the pseudoflow /. Because of c p (x,y) = —c p (y,x) then also 
c p (y, x) = and (y, x) preserves e-optimality of /. 

(b) (x,y) G E f and (y, x) g E f 

We want to see if the edge (x, y) preserves e-optimality of the pseud- 
oflow / after the fulfillment stages of relabel(x) and relabel(y). 
Let (x, z) be the edge that affects on the new price p(x) of x. 
Since c p (x,z) < c p (x,y) then c' p (x, z) < c' p (x,y) and 
p(x) = -(d p (x, z) + e)> -(d p (x, y) + e) = -c(x, y) + p(y) - e > 
-c(x,y) +p(y) - e. 

Therefore c p (x,y) = c(x,y) +p(x) — p(y) > — e. Q.E.D 
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(c) (x,y) i E f and (y,x) G E f 
The proof is the same as above. 

(d) (x,y) i E f and (y,x) <£ E f 

If (x, y) and (y, x) are not residual then after the relabel opera- 
tions they will not be either. Therefore, this case is trivial. 

4. applying push(x, y) and push(z, y) operations 

Following Lemma 5.3, this case is equivalent a stage-clean trace hence 
it reduces to the cases: 1 and 2. 

5. applying push(x, y) and relabel(y) operations 

This case is equivalent to either a stage-clean or a stage-stepping trace. 
The first scenario can be reduced to the cases 1 and 2, while the second 
has the following cases. Note (x, y) G Ef because the push operation 
is applicable. 

(a) (y, x) G Ef 

Since the push operation preserves e-optimality of the pseud- 
oflow /, then from the inductive assumption, the pseudoflow / 
is e-optimal after the push operation. 

The fulfillment stage of push(x, y) does not affect the value of c p (y, x). If 
< c p (y, x) < min {c p (y, z), (y, z) G Ef} before the fulfillment stage 
of relabel(y) then the new price of the node y satisfies p(y) = 
— (mm.(y tZ ) € Ef c 'p(y> z ) + e ) > —( c ' P {y^ x ) + e )- Therefore, after the 
fulfillment stage of relabel(y) we obtain: c p (y, x) = c(y, x)+p(y) — 
p(x) > c(y,x)-c' p (y,x)-e-p(x) = c(y, x) - c(y, x) +p(x) - e - 
p{x) = —e. Then the residual edge (y, x) is e-optimal with respect 
to the new price of the node y. 

(b) (y,x)tE f 

There are two subscenarios. 

i. {y,x) G E 

In the situation that (y, x) G E and (y, x) ^ Ef we have 
f(y, x) = Uf(y,x). However, the fulfillment stage of push(x, y) 
pushes the flow through the edge (x, y) and hence adds the re- 
verse edge (y,x) to Ef. Otherwise, the preparation stage 
of relabel(y), which is performed before the fulfillment stage 
of push(x,y), would not consider the edge (y,x) (because 
it would be added to Ef after that stage). Let (y, zq) be 
the edge that affects the new price of y in the preparation 
stage of relabel(y). After the trace, it may occur that the new 
residual edge (y,x) has a lower reduced cost than (y,zo), 
that is c p (y,x) < c p (y,z ) and hence c' p (y,x) < c' p (y,z ). 



27 



Therefore — (c' p (y, x) + e) > — (c' p (y, zq) +e) what implies that 
— (c(y, x)—p(x)+e) > p(y) and hence c(y, x) + p(y) — p{x) < — e 
and Cp(y, x) < — e. Then the residulal edge (y,x) spoils 
the e-optimality of the pseudoflow /. Further we show that 
this situation is temporary and in the next step for the node y, 
f becomes an e-optimal pseudoflow. For the node y, e{y) > 0. 
Moreover, the residual edge (y, x) has the smallest reduced 
cost among all the residual edges outgoing from y. Then 
the push operation towards x can be performed on y and 
Uf(y,x) = min {uf(y, x), e(y)}. Therefore, in the next step 
for the node y, the push operation removes the residual edge 
(y, x) from the residual graph and hence remove the require- 
ment c p (y, x) > — e. 
ii. (y,x) (£ E 

In the situation that (y, x) £ E and (y, x) ^ Ef there must be 
f(x, y) = (if f(x, y) > then u f (y, x) = u(y, x) - f(y, x) = 
u(y,x) + f(x,y) > and then it would be (y,x) G Ef). 
Similarly to the previous case, the residual edge (y, x) will 
be added to Ef in the fulfillment stage of push(x, y) and 
the e-optimality of the flow might be violated because of (y, x). 
However, like in the previous case, the residual edge (y, x) 
will be removed from Ef because of the push operation from 
y towards x which will push the flow of value /(y, x). 

6. applying push(x, y) and push(y, z) operations. 

According to Lemma 5.3 this trace is equivalent to a stage-clean trace, 
where push(x, y) and push(y, z) are performed one by one. Then this 
case can be reduced to the case 2 above. The e-optimality is preserved 
by this trace. 

7. applying push(x, y) and relabel(x) operations. 

According to Lemma 5.2 this two operations cannot be intereleaved 
because the active node can be dealt with only by one operation per 
step of the algorithm. 

8. applying push(x, y) and push(y, x) operations. 

These two operations cannot be intereleaved because the two condi- 
tions c p (x,y) < and c p (y,x) < are mutually exclusive. 

9. applying more operations than two. 

Any trace constisting of more than two operations is interleaved. 
The proof is similar to the above and is omitted. 

□ 
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Lemma 5.6. When the algorithm terminates the pseudoflow f is e-optimal. 



Proof. The initial pseudoflow / is e-optimal. According to lemma 5.5, any trace 
consisting of the flow and/or the price update operations preserve e-optimality 
of /. According to lemma 5.1, the refine procedure can terminate only 
if there is no active node, which implies that the pseudoflow / is an e-optimal 
flow. □ 



5.5 Our Implementation 

Our implementation of the cost scaling algorithm uses the method pre- 
sented in Algorithm 5.4, the global price update heuristic in Algorithm 5.3 
and the arc-fixing heuristic described above. We also use our implementation 
of the lock-free push-relabel algorithm changed appropriately for the cost- 
scaling method. 

After loading the input, the graph is constructed in the host global mem- 
ory. Next, all arrays are copied to the global memory on device. Since 
the copied addresses point to locations in the host memory and the graph can 
be arbitrary we use an additional array to store information about adjacency 
lists of vertices and set valid addresses in the device memory. Futher, while 
running the algorithm only three arrays are copied between host and device: 
the prices, the excesses and the flows. 

In each iteration of the refine loop the host thread calls the push-relabel 
kernel until the pseudoflow / becomes a flow, that is, 
until EXCESS(source) = and EXCESS(sink) = Total, 
where Total = \X\ = \Y\. The kernels are launched by \V\ + 2 threads. 
The control is returned back to the host in two cases. First, after each 
CYCLE iterations, {CYCLE has been preset to 500000). In the second 
case, if EXCESS(source) = and EXCESS(sink) = Total. The first 
case means that the calculation is not finished, but the control is returned 
to the host. The second means that the pseudoflow / becomes a flow. 
In the first case, it is common that the running kernel is interrupted and 
restored without changes in the graph. The purpose of this is to avoid ter- 
minating the execution of the kernel by GPU (launch timed out). This often 
happens when the graph has many edges. 

In our implementation only after the first running of the push-relabel 
kernel the heuristics are performed. They can be executed many times but 
the way we did it yields the best results. Similarly as in [15] we chose 
the value of ALPHA constant equal 10 because in our tests other values 
much extended the running time of the program. The best results of the al- 
gorithm are achived for the thread block of size 32 x 16. 

The first heuristic executed is arc_fixing kernel then globalPriceUpdate 
C procedure. The arc _ fixing kernel deletes edges from the graph residing 
in the device memory by removing them from the adjacency lists and sets 
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the flows on them to —10 (any negative constant would do) in order not 
to free the memory. Then the arrays of flows and prices are copied to the 
host memory. To make the graphs on the device and on the host equal, 
the procedure of the global relabeling deletes from the host graph the edges 
which were previously deleted from the graph on the device memory. They 
are recognized by the value of a flow equal —10. Next the global relabeling is 
performed and further the new prices are copied back to the device memory. 

6 Discussion and Future Work 

In this paper we have presented an efficient implementation of Hong's par- 
allel non-blocking push-relabel algorithm and an implementation of the cost- 
scaling algorithm for the assignment problem. The Refine procedure of the cost- 
scaling algorithm uses the parallel push-relabel lock-free algorithm and runs 
in 0(n 2 m) time, where the complexity is analyzed in respect to the number of 
push and relabel operations. The amount of work performed by the program 
(the total number of instructions executed) is the same as in the sequential 
Algorithm 5.2, i.e. 0(re 2 mmin{log(nC),mlogn}). Both implementations 
run on GPU using CUDA architecutre. 

For our purpose, the cost scaling algorithm for the assignment problem 
is used for the complete bipartite graphs of size |X| = \Y\ <= 30. The exe- 
cution time of our implementation for graphs of this size and costs of edges 
at most 100, is about l/20s which allows for real-time applications. 

The maximum efficiency of an algorithm implemented in CUDA architec- 
ture is achieved when the number of threads is large and all threads execute 
the same task. Otherwise the running time can be equal the running time 
of a sequential algorithm. In the case of the large complete bipartite graphs 
the presented algorithm is not efficient. 

Further reasearch could be made towards the use of the CUDA ar- 
chitecutre to finding a maximum cardinality matching of maximum weight 
for arbitrary (nonbipartite) graphs. 
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